Debugging scikit-learn text classification pipeline

scikit-learn docs provide a nice text classification tutorial. Make sure to read it first. We'll be doing something similar to it, while taking more detailed look at classifier weights and predictions.

1. Baseline model

First, we need some data. Let's load 20 Newsgroups data, keeping only 4 categories:



In [1]:

    
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42
)
twenty_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42
)

A basic text processing pipeline - bag of words features and Logistic Regression as a classifier:



In [2]:

    
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import make_pipeline

vec = CountVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);

We're using LogisticRegressionCV here to adjust regularization parameter C automatically. It allows to compare different vectorizers - optimal C value could be different for different input features (e.g. for bigrams or for character-level input). An alternative would be to use GridSearchCV or RandomizedSearchCV.

Let's check quality of this pipeline:



In [3]:

    
from sklearn import metrics

def print_report(pipe):
    y_test = twenty_test.target
    y_pred = pipe.predict(twenty_test.data)
    report = metrics.classification_report(y_test, y_pred, 
        target_names=twenty_test.target_names)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))
    
print_report(pipe)









    



                        precision    recall  f1-score   support

           alt.atheism       0.93      0.80      0.86       319
         comp.graphics       0.87      0.96      0.91       389
               sci.med       0.94      0.81      0.87       396
soc.religion.christian       0.85      0.98      0.91       398

           avg / total       0.90      0.89      0.89      1502

accuracy: 0.891

Not bad. We can try other classifiers and preprocessing methods, but let's check first what the model learned using eli5.show_weights function:



In [4]:

    
import eli5
eli5.show_weights(clf, top=10)









    Out[4]:





    



    

    

    

    

    

    


    

    

    

    
        

    
        
            
                
                    
                        
    
        y=0
    


top features
                    
                
                    
                        
    
        y=1
    


top features
                    
                
                    
                        
    
        y=2
    


top features
                    
                
                    
                        
    
        y=3
    


top features
                    
                
            
            
                
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +1.991
    
    
        x21167
    
    

        
            
    
        +1.925
    
    
        x19218
    
    

        
            
    
        +1.834
    
    
        x5714
    
    

        
            
    
        +1.813
    
    
        x23677
    
    

        
            
    
        +1.697
    
    
        x15511
    
    

        
            
    
        +1.696
    
    
        x26415
    
    

        
            
    
        +1.617
    
    
        x6440
    
    

        
            
    
        +1.594
    
    
        x26412
    
    

        
        
            
                
                    … 10174 more positive …
                
            
        

        
            
                
                    … 25605 more negative …
                
            
        
        
            
    
        -1.686
    
    
        x28473
    
    

        
            
    
        -10.453
    
    
        <BIAS>
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +1.702
    
    
        x15699
    
    

        
            
    
        +0.825
    
    
        x17366
    
    

        
            
    
        +0.798
    
    
        x14281
    
    

        
            
    
        +0.786
    
    
        x30117
    
    

        
            
    
        +0.779
    
    
        x14277
    
    

        
            
    
        +0.773
    
    
        x17356
    
    

        
            
    
        +0.729
    
    
        x24267
    
    

        
            
    
        +0.724
    
    
        x7874
    
    

        
            
    
        +0.702
    
    
        x2148
    
    

        
        
            
                
                    … 11710 more positive …
                
            
        

        
            
                
                    … 24069 more negative …
                
            
        
        
            
    
        -1.379
    
    
        <BIAS>
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +2.016
    
    
        x25234
    
    

        
            
    
        +1.951
    
    
        x12026
    
    

        
            
    
        +1.758
    
    
        x17854
    
    

        
            
    
        +1.697
    
    
        x11729
    
    

        
            
    
        +1.655
    
    
        x32847
    
    

        
            
    
        +1.522
    
    
        x22379
    
    

        
            
    
        +1.518
    
    
        x16328
    
    

        
        
            
                
                    … 15007 more positive …
                
            
        

        
            
                
                    … 20772 more negative …
                
            
        
        
            
    
        -1.764
    
    
        x15521
    
    

        
            
    
        -2.171
    
    
        x15699
    
    

        
            
    
        -5.013
    
    
        <BIAS>
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +1.193
    
    
        x28473
    
    

        
            
    
        +1.030
    
    
        x8609
    
    

        
            
    
        +1.021
    
    
        x8559
    
    

        
            
    
        +0.946
    
    
        x8798
    
    

        
            
    
        +0.899
    
    
        x8544
    
    

        
            
    
        +0.797
    
    
        x8553
    
    

        
        
            
                
                    … 11122 more positive …
                
            
        

        
            
                
                    … 24657 more negative …
                
            
        
        
            
    
        -0.852
    
    
        x15699
    
    

        
            
    
        -0.894
    
    
        x25663
    
    

        
            
    
        -1.181
    
    
        x23122
    
    

        
            
    
        -1.243
    
    
        x16881

The table above doesn't make any sense; the problem is that eli5 was not able to get feature and class names from the classifier object alone. We can provide feature and target names explicitly:



In [5]:

    
# eli5.show_weights(clf, 
#                   feature_names=vec.get_feature_names(), 
#                   target_names=twenty_test.target_names)

The code above works, but a better way is to provide vectorizer instead and let eli5 figure out the details automatically:



In [6]:

    
eli5.show_weights(clf, vec=vec, top=10, 
                  target_names=twenty_test.target_names)









    Out[6]:





    



    

    

    

    

    

    


    

    

    

    
        

    
        
            
                
                    
                        
    
        y=alt.atheism
    


top features
                    
                
                    
                        
    
        y=comp.graphics
    


top features
                    
                
                    
                        
    
        y=sci.med
    


top features
                    
                
                    
                        
    
        y=soc.religion.christian
    


top features
                    
                
            
            
                
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +1.991
    
    
        mathew
    
    

        
            
    
        +1.925
    
    
        keith
    
    

        
            
    
        +1.834
    
    
        atheism
    
    

        
            
    
        +1.813
    
    
        okcforum
    
    

        
            
    
        +1.697
    
    
        go
    
    

        
            
    
        +1.696
    
    
        psuvm
    
    

        
            
    
        +1.617
    
    
        believing
    
    

        
            
    
        +1.594
    
    
        psu
    
    

        
        
            
                
                    … 10174 more positive …
                
            
        

        
            
                
                    … 25605 more negative …
                
            
        
        
            
    
        -1.686
    
    
        rutgers
    
    

        
            
    
        -10.453
    
    
        <BIAS>
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +1.702
    
    
        graphics
    
    

        
            
    
        +0.825
    
    
        images
    
    

        
            
    
        +0.798
    
    
        files
    
    

        
            
    
        +0.786
    
    
        software
    
    

        
            
    
        +0.779
    
    
        file
    
    

        
            
    
        +0.773
    
    
        image
    
    

        
            
    
        +0.729
    
    
        package
    
    

        
            
    
        +0.724
    
    
        card
    
    

        
            
    
        +0.702
    
    
        3d
    
    

        
        
            
                
                    … 11710 more positive …
                
            
        

        
            
                
                    … 24069 more negative …
                
            
        
        
            
    
        -1.379
    
    
        <BIAS>
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +2.016
    
    
        pitt
    
    

        
            
    
        +1.951
    
    
        doctor
    
    

        
            
    
        +1.758
    
    
        information
    
    

        
            
    
        +1.697
    
    
        disease
    
    

        
            
    
        +1.655
    
    
        treatment
    
    

        
            
    
        +1.522
    
    
        msg
    
    

        
            
    
        +1.518
    
    
        health
    
    

        
        
            
                
                    … 15007 more positive …
                
            
        

        
            
                
                    … 20772 more negative …
                
            
        
        
            
    
        -1.764
    
    
        god
    
    

        
            
    
        -2.171
    
    
        graphics
    
    

        
            
    
        -5.013
    
    
        <BIAS>
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +1.193
    
    
        rutgers
    
    

        
            
    
        +1.030
    
    
        church
    
    

        
            
    
        +1.021
    
    
        christians
    
    

        
            
    
        +0.946
    
    
        clh
    
    

        
            
    
        +0.899
    
    
        christ
    
    

        
            
    
        +0.797
    
    
        christian
    
    

        
        
            
                
                    … 11122 more positive …
                
            
        

        
            
                
                    … 24657 more negative …
                
            
        
        
            
    
        -0.852
    
    
        graphics
    
    

        
            
    
        -0.894
    
    
        posting
    
    

        
            
    
        -1.181
    
    
        nntp
    
    

        
            
    
        -1.243
    
    
        host

This starts to make more sense. Columns are target classes. In each column there are features and their weights. Intercept (bias) feature is shown as <BIAS> in the same table. We can inspect features and weights because we're using a bag-of-words vectorizer and a linear classifier (so there is a direct mapping between individual words and classifier coefficients). For other classifiers features can be harder to inspect.

Some features look good, but some don't. It seems model learned some names specific to a dataset (email parts, etc.) though, instead of learning topic-specific words. Let's check prediction results on an example:



In [7]:

    
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names)









    Out[7]:





    



    

    

    

    

    

    


    

    

    

    
        

    

    
        
    
        
        
    
        
            
    
        y=alt.atheism
    


    
    (probability 0.000, score -8.709)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +1.743
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -10.453
    
    
        <BIAS>
    
    

        

        
    

    



    
        from: brian@ucsd.edu (brian kantor)
subject: re: help for kidney stones ..............
organization: the avant-garde of the now, ltd.
lines: 12
nntp-posting-host: ucsd.edu

as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

demerol worked, although i nearly got arrested on my way home when i barfed
all over the police car parked just outside the er.
	- brian

    

    
        
    
        
        
    
        
            
    
        y=comp.graphics
    


    
    (probability 0.010, score -4.592)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
        

        
        
            
    
        -1.379
    
    
        <BIAS>
    
    

        
            
    
        -3.213
    
    
        Highlighted in text (sum)
    
    

        

        
    

    



    
        from: brian@ucsd.edu (brian kantor)
subject: re: help for kidney stones ..............
organization: the avant-garde of the now, ltd.
lines: 12
nntp-posting-host: ucsd.edu

as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

demerol worked, although i nearly got arrested on my way home when i barfed
all over the police car parked just outside the er.
	- brian

    

    
        
    
        
        
    
        
            
    
        y=sci.med
    


    
    (probability 0.989, score 3.945)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +8.958
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -5.013
    
    
        <BIAS>
    
    

        

        
    

    



    
        from: brian@ucsd.edu (brian kantor)
subject: re: help for kidney stones ..............
organization: the avant-garde of the now, ltd.
lines: 12
nntp-posting-host: ucsd.edu

as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

demerol worked, although i nearly got arrested on my way home when i barfed
all over the police car parked just outside the er.
	- brian

    

    
        
    
        
        
    
        
            
    
        y=soc.religion.christian
    


    
    (probability 0.001, score -7.157)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
        

        
        
            
    
        -0.258
    
    
        <BIAS>
    
    

        
            
    
        -6.899
    
    
        Highlighted in text (sum)
    
    

        

        
    

    



    
        from: brian@ucsd.edu (brian kantor)
subject: re: help for kidney stones ..............
organization: the avant-garde of the now, ltd.
lines: 12
nntp-posting-host: ucsd.edu

as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

demerol worked, although i nearly got arrested on my way home when i barfed
all over the police car parked just outside the er.
	- brian

What can be highlighted in text is highlighted in text. There is also a separate table for features which can't be highlighted in text - <BIAS> in this case. If you hover mouse on a highlighted word it shows you a weight of this word in a title. Words are colored according to their weights.

2. Baseline model, improved data

Aha, from the highlighting above it can be seen that a classifier learned some non-interesting stuff indeed, e.g. it remembered parts of email addresses. We should probably clean the data first to make it more interesting; improving model (trying different classifiers, etc.) doesn't make sense at this point - it may just learn to leverage these email addresses better.

In practice we'd have to do cleaning yourselves; in this example 20 newsgroups dataset provides an option to remove footers and headers from the messages. Nice. Let's clean up the data and re-train a classifier.



In [8]:

    
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=['headers', 'footers'],
)
twenty_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=['headers', 'footers'],
)

vec = CountVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);

We just made the task harder and more realistic for a classifier.



In [9]:

    
print_report(pipe)









    



                        precision    recall  f1-score   support

           alt.atheism       0.83      0.78      0.80       319
         comp.graphics       0.82      0.96      0.88       389
               sci.med       0.89      0.80      0.84       396
soc.religion.christian       0.88      0.86      0.87       398

           avg / total       0.85      0.85      0.85      1502

accuracy: 0.852

A great result - we just made quality worse! Does it mean pipeline is worse now? No, likely it has a better quality on unseen messages. It is evaluation which is more fair now. Inspecting features used by classifier allowed us to notice a problem with the data and made a good change, despite of numbers which told us not to do that.

Instead of removing headers and footers we could have improved evaluation setup directly, using e.g. GroupKFold from scikit-learn. Then quality of old model would have dropped, we could have removed headers/footers and see increased accuracy, so the numbers would have told us to remove headers and footers. It is not obvious how to split data though, what groups to use with GroupKFold.

So, what have the updated classifier learned? (output is less verbose because only a subset of classes is shown - see "targets" argument):



In [10]:

    
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])









    Out[10]:





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=sci.med
    


    
    (probability 0.732, score 0.031)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +1.747
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -1.716
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

Hm, it no longer uses email addresses, but it still doesn't look good: classifier assigns high weights to seemingly unrelated words like 'do' or 'my'. These words appear in many texts, so maybe classifier uses them as a proxy for bias. Or maybe some of them are more common in some of classes.

3. Pipeline improvements

To help classifier we may filter out stop words:



In [11]:

    
vec = CountVectorizer(stop_words='english')
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)









    



                        precision    recall  f1-score   support

           alt.atheism       0.87      0.76      0.81       319
         comp.graphics       0.85      0.95      0.90       389
               sci.med       0.93      0.85      0.89       396
soc.religion.christian       0.85      0.89      0.87       398

           avg / total       0.87      0.87      0.87      1502

accuracy: 0.871



In [12]:

    
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])









    Out[12]:





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=sci.med
    


    
    (probability 0.714, score 0.510)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +2.184
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -1.674
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

Looks better, isn't it?

Alternatively, we can use TF*IDF scheme; it should give a somewhat similar effect.

Note that we're cross-validating LogisticRegression regularisation parameter here, like in other examples (LogisticRegressionCV, not LogisticRegression). TF*IDF values are different from word count values, so optimal C value can be different. We could draw a wrong conclusion if a classifier with fixed regularization strength is used - the chosen C value could have worked better for one kind of data.



In [13]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)









    



                        precision    recall  f1-score   support

           alt.atheism       0.91      0.79      0.85       319
         comp.graphics       0.83      0.97      0.90       389
               sci.med       0.95      0.87      0.91       396
soc.religion.christian       0.90      0.91      0.91       398

           avg / total       0.90      0.89      0.89      1502

accuracy: 0.892



In [14]:

    
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])









    Out[14]:





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=sci.med
    


    
    (probability 0.987, score 1.585)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +6.788
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -5.203
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

It helped, but didn't have quite the same effect. Why not do both?



In [15]:

    
vec = TfidfVectorizer(stop_words='english')
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)









    



                        precision    recall  f1-score   support

           alt.atheism       0.93      0.77      0.84       319
         comp.graphics       0.84      0.97      0.90       389
               sci.med       0.95      0.89      0.92       396
soc.religion.christian       0.88      0.92      0.90       398

           avg / total       0.90      0.89      0.89      1502

accuracy: 0.893



In [16]:

    
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])









    Out[16]:





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=sci.med
    


    
    (probability 0.939, score 1.910)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +5.488
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -3.578
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

This starts to look good!

4. Char-based pipeline

Maybe we can get somewhat better quality by choosing a different classifier, but let's skip it for now. Let's try other analysers instead - use char n-grams instead of words:



In [17]:

    
vec = TfidfVectorizer(stop_words='english', analyzer='char', 
                      ngram_range=(3,5))
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)









    



                        precision    recall  f1-score   support

           alt.atheism       0.93      0.79      0.85       319
         comp.graphics       0.81      0.97      0.89       389
               sci.med       0.95      0.86      0.90       396
soc.religion.christian       0.89      0.91      0.90       398

           avg / total       0.89      0.89      0.89      1502

accuracy: 0.888



In [18]:

    
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names)









    Out[18]:





    



    

    

    

    

    

    


    

    

    

    
        

    

    
        
    
        
        
    
        
            
    
        y=alt.atheism
    


    
    (probability 0.002, score -7.318)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
        

        
        
            
    
        -0.838
    
    
        Highlighted in text (sum)
    
    

        
            
    
        -6.480
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have
to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.
    

    
        
    
        
        
    
        
            
    
        y=comp.graphics
    


    
    (probability 0.017, score -5.118)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +0.934
    
    
        <BIAS>
    
    

        
        

        
        
            
    
        -6.052
    
    
        Highlighted in text (sum)
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have
to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.
    

    
        
    
        
        
    
        
            
    
        y=sci.med
    


    
    (probability 0.963, score -0.656)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +4.493
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -5.149
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have
to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.
    

    
        
    
        
        
    
        
            
    
        y=soc.religion.christian
    


    
    (probability 0.018, score -5.048)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +0.600
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -5.648
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have
to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

It works, but quality is a bit worse. Also, it takes ages to train.

It looks like stop_words have no effect now - in fact, this is documented in scikit-learn docs, so our stop_words='english' was useless. But at least it is now more obvious how the text looks like for a char ngram-based classifier. Grab a cup of tea and see how char_wb looks like:



In [19]:

    
vec = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,5))
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)









    



                        precision    recall  f1-score   support

           alt.atheism       0.93      0.79      0.85       319
         comp.graphics       0.87      0.96      0.91       389
               sci.med       0.91      0.90      0.90       396
soc.religion.christian       0.89      0.91      0.90       398

           avg / total       0.90      0.89      0.89      1502

accuracy: 0.894



In [20]:

    
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names)









    Out[20]:





    



    

    

    

    

    

    


    

    

    

    
        

    

    
        
    
        
        
    
        
            
    
        y=alt.atheism
    


    
    (probability 0.000, score -8.878)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
        

        
        
            
    
        -2.560
    
    
        Highlighted in text (sum)
    
    

        
            
    
        -6.318
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have
to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.
    

    
        
    
        
        
    
        
            
    
        y=comp.graphics
    


    
    (probability 0.005, score -6.007)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +0.974
    
    
        <BIAS>
    
    

        
        

        
        
            
    
        -6.981
    
    
        Highlighted in text (sum)
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have
to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.
    

    
        
    
        
        
    
        
            
    
        y=sci.med
    


    
    (probability 0.834, score -0.440)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +2.134
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -2.573
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have
to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.
    

    
        
    
        
        
    
        
            
    
        y=soc.religion.christian
    


    
    (probability 0.160, score -2.510)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +3.263
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -5.773
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have
to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

The result is similar, with some minor changes. Quality is better for unknown reason; maybe cross-word dependencies are not that important.

5. Debugging HashingVectorizer

To check that we can try fitting word n-grams instead of char n-grams. But let's deal with efficiency first. To handle large vocabularies we can use HashingVectorizer from scikit-learn; to make training faster we can employ SGDCLassifier:



In [21]:

    
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vec = HashingVectorizer(stop_words='english', ngram_range=(1,2))
clf = SGDClassifier(n_iter=10, random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)









    



                        precision    recall  f1-score   support

           alt.atheism       0.90      0.80      0.85       319
         comp.graphics       0.88      0.96      0.92       389
               sci.med       0.93      0.90      0.92       396
soc.religion.christian       0.89      0.91      0.90       398

           avg / total       0.90      0.90      0.90      1502

accuracy: 0.899

It was super-fast! We're not choosing regularization parameter using cross-validation though. Let's check what model learned:



In [22]:

    
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])









    Out[22]:





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=sci.med
    


    
    (score 0.097)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +0.678
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -0.581
    
    
        <BIAS>
    
    

        

        
    

    



    
        as i recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

when i was in, the x-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

Result looks similar to CountVectorizer. But with HashingVectorizer we don't even have a vocabulary! Why does it work?



In [23]:

    
eli5.show_weights(clf, vec=vec, top=10,
                  target_names=twenty_test.target_names)









    Out[23]:





    



    

    

    

    

    

    


    

    

    

    
        

    
        
            
                
                    
                        
    
        y=alt.atheism
    


top features
                    
                
                    
                        
    
        y=comp.graphics
    


top features
                    
                
                    
                        
    
        y=sci.med
    


top features
                    
                
                    
                        
    
        y=soc.religion.christian
    


top features
                    
                
            
            
                
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +2.836
    
    
        x199378
    
    

        
            
    
        +2.378
    
    
        x938889
    
    

        
            
    
        +1.776
    
    
        x718537
    
    

        
            
    
        +1.625
    
    
        x349126
    
    

        
            
    
        +1.554
    
    
        x242643
    
    

        
            
    
        +1.509
    
    
        x71928
    
    

        
        
            
                
                    … 50341 more positive …
                
            
        

        
            
                
                    … 50567 more negative …
                
            
        
        
            
    
        -1.634
    
    
        x683213
    
    

        
            
    
        -1.795
    
    
        x741207
    
    

        
            
    
        -1.872
    
    
        x199709
    
    

        
            
    
        -2.132
    
    
        x641063
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +3.737
    
    
        x580586
    
    

        
            
    
        +2.056
    
    
        x342790
    
    

        
            
    
        +1.956
    
    
        x771885
    
    

        
            
    
        +1.787
    
    
        x363686
    
    

        
            
    
        +1.717
    
    
        x111283
    
    

        
        
            
                
                    … 32081 more positive …
                
            
        

        
            
                
                    … 31710 more negative …
                
            
        
        
            
    
        -1.760
    
    
        x857427
    
    

        
            
    
        -1.779
    
    
        x85557
    
    

        
            
    
        -1.813
    
    
        x693269
    
    

        
            
    
        -2.021
    
    
        x120354
    
    

        
            
    
        -2.447
    
    
        x814572
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +2.209
    
    
        x988761
    
    

        
            
    
        +2.194
    
    
        x337555
    
    

        
            
    
        +2.162
    
    
        x154565
    
    

        
            
    
        +1.818
    
    
        x806262
    
    

        
        
            
                
                    … 44124 more positive …
                
            
        

        
            
                
                    … 43892 more negative …
                
            
        
        
            
    
        -1.704
    
    
        x790864
    
    

        
            
    
        -1.750
    
    
        x580586
    
    

        
            
    
        -1.851
    
    
        x34701
    
    

        
            
    
        -2.085
    
    
        x85557
    
    

        
            
    
        -2.147
    
    
        x365313
    
    

        
            
    
        -2.150
    
    
        x494508
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +3.034
    
    
        x641063
    
    

        
            
    
        +3.016
    
    
        x199709
    
    

        
            
    
        +2.977
    
    
        x741207
    
    

        
            
    
        +2.092
    
    
        x396081
    
    

        
            
    
        +1.901
    
    
        x274863
    
    

        
        
            
                
                    … 51475 more positive …
                
            
        

        
            
                
                    … 51717 more negative …
                
            
        
        
            
    
        -1.963
    
    
        x672777
    
    

        
            
    
        -2.096
    
    
        x199378
    
    

        
            
    
        -2.143
    
    
        x443433
    
    

        
            
    
        -2.963
    
    
        x718537
    
    

        
            
    
        -3.245
    
    
        x970058

Ok, we don't have a vocabulary, so we don't have feature names. Are we out of luck? Nope, eli5 has an answer for that: InvertableHashingVectorizer. It can be used to get feature names for HahshingVectorizer without fitiing a huge vocabulary. It still needs some data to learn words -> hashes mapping though; we can use a random subset of data to fit it.



In [24]:

    
from eli5.sklearn import InvertableHashingVectorizer
import numpy as np



In [25]:

    
ivec = InvertableHashingVectorizer(vec)
sample_size = len(twenty_train.data) // 10
X_sample = np.random.choice(twenty_train.data, size=sample_size)
ivec.fit(X_sample);



In [26]:

    
eli5.show_weights(clf, vec=ivec, top=20,
                  target_names=twenty_test.target_names)









    Out[26]:





    



    

    

    

    

    

    


    

    

    

    
        

    
        
            
                
                    
                        
    
        y=alt.atheism
    


top features
                    
                
                    
                        
    
        y=comp.graphics
    


top features
                    
                
                    
                        
    
        y=sci.med
    


top features
                    
                
                    
                        
    
        y=soc.religion.christian
    


top features
                    
                
            
            
                
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +2.836
    
    
        atheism
    
    

        
            
    
        +2.378
    
    
        writes
    
    

        
            
    
        +1.634
    
    
        morality
    
    

        
            
    
        +1.625
    
    
        motto …
    
    

        
            
    
        +1.554
    
    
        religion
    
    

        
            
    
        +1.509
    
    
        islam
    
    

        
            
    
        +1.489
    
    
        keith
    
    

        
            
    
        +1.476
    
    
        religious
    
    

        
            
    
        +1.439
    
    
        objective …
    
    

        
            
    
        +1.414
    
    
        wrote
    
    

        
            
    
        +1.405
    
    
        said
    
    

        
            
    
        +1.361
    
    
        punishment
    
    

        
            
    
        +1.335
    
    
        livesey
    
    

        
            
    
        +1.332
    
    
        mathew
    
    

        
            
    
        +1.324
    
    
        atheist
    
    

        
            
    
        +1.320
    
    
        agree
    
    

        
        
            
                
                    … 47696 more positive …
                
            
        

        
            
                
                    … 53202 more negative …
                
            
        
        
            
    
        -1.776
    
    
        rutgers edu
    
    

        
            
    
        -1.795
    
    
        rutgers
    
    

        
            
    
        -1.872
    
    
        christ …
    
    

        
            
    
        -2.132
    
    
        christians
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +3.737
    
    
        graphics
    
    

        
            
    
        +2.447
    
    
        image
    
    

        
            
    
        +2.056
    
    
        code
    
    

        
            
    
        +2.021
    
    
        files
    
    

        
            
    
        +1.956
    
    
        images
    
    

        
            
    
        +1.813
    
    
        3d
    
    

        
            
    
        +1.787
    
    
        software
    
    

        
            
    
        +1.717
    
    
        file
    
    

        
            
    
        +1.701
    
    
        ftp
    
    

        
            
    
        +1.587
    
    
        video
    
    

        
            
    
        +1.572
    
    
        keywords
    
    

        
            
    
        +1.572
    
    
        card
    
    

        
            
    
        +1.509
    
    
        points
    
    

        
            
    
        +1.500
    
    
        line
    
    

        
            
    
        +1.494
    
    
        need
    
    

        
            
    
        +1.483
    
    
        computer
    
    

        
            
    
        +1.470
    
    
        hi
    
    

        
        
            
                
                    … 30146 more positive …
                
            
        

        
            
                
                    … 33635 more negative …
                
            
        
        
            
    
        -1.654
    
    
        people
    
    

        
            
    
        -1.760
    
    
        keyboard
    
    

        
            
    
        -1.779
    
    
        god
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +2.209
    
    
        health
    
    

        
            
    
        +2.194
    
    
        msg
    
    

        
            
    
        +2.162
    
    
        doctor
    
    

        
            
    
        +2.150
    
    
        disease
    
    

        
            
    
        +2.147
    
    
        treatment
    
    

        
            
    
        +1.851
    
    
        medical
    
    

        
            
    
        +1.818
    
    
        com
    
    

        
            
    
        +1.704
    
    
        pain
    
    

        
            
    
        +1.663
    
    
        effects
    
    

        
            
    
        +1.616
    
    
        cancer
    
    

        
            
    
        +1.513
    
    
        case
    
    

        
            
    
        +1.453
    
    
        diet
    
    

        
            
    
        +1.447
    
    
        blood
    
    

        
            
    
        +1.439
    
    
        information
    
    

        
            
    
        +1.435
    
    
        keyboard
    
    

        
            
    
        +1.407
    
    
        pitt
    
    

        
        
            
                
                    … 42291 more positive …
                
            
        

        
            
                
                    … 45715 more negative …
                
            
        
        
            
    
        -1.462
    
    
        church
    
    

        
            
    
        -1.697
    
    
        FEATURE[354651]
    
    

        
            
    
        -1.750
    
    
        graphics
    
    

        
            
    
        -2.085
    
    
        god
    
    

        

        
    

                                
                            
                        
                    
                        
                            
                                
                                    
                                    
    
    
        
        
            
                
                    Weight^?
                
            
            Feature
            
        
        
        
        
            
    
        +3.245
    
    
        church
    
    

        
            
    
        +3.034
    
    
        christians
    
    

        
            
    
        +3.016
    
    
        christ …
    
    

        
            
    
        +2.977
    
    
        rutgers
    
    

        
            
    
        +2.963
    
    
        rutgers edu
    
    

        
            
    
        +2.143
    
    
        christian
    
    

        
            
    
        +2.092
    
    
        heaven …
    
    

        
            
    
        +1.963
    
    
        love
    
    

        
            
    
        +1.901
    
    
        athos rutgers
    
    

        
            
    
        +1.901
    
    
        athos
    
    

        
            
    
        +1.741
    
    
        satan
    
    

        
            
    
        +1.714
    
    
        authority
    
    

        
            
    
        +1.653
    
    
        faith
    
    

        
            
    
        +1.644
    
    
        1993
    
    

        
            
    
        +1.643
    
    
        article apr
    
    

        
            
    
        +1.633
    
    
        understanding
    
    

        
            
    
        +1.541
    
    
        sin …
    
    

        
            
    
        +1.509
    
    
        god
    
    

        
        
            
                
                    … 49948 more positive …
                
            
        

        
            
                
                    … 53234 more negative …
                
            
        
        
            
    
        -1.525
    
    
        graphics
    
    

        
            
    
        -2.096
    
    
        atheism

There are collisions (hover mouse over features with "..."), and there are important features which were not seen in the random sample (FEATURE[...]), but overall it looks fine.

"rutgers edu" bigram feature is suspicious though, it looks like a part of URL.



In [27]:

    
rutgers_example = [x for x in twenty_train.data if 'rutgers' in x.lower()][0]
print(rutgers_example)









    



In article <Apr.8.00.57.41.1993.28246@athos.rutgers.edu> REXLEX@fnal.gov writes:
>In article <Apr.7.01.56.56.1993.22824@athos.rutgers.edu> shrum@hpfcso.fc.hp.com
>Matt. 22:9-14 'Go therefore to the main highways, and as many as you find
>there, invite to the wedding feast.'...

>hmmmmmm.  Sounds like your theology and Christ's are at odds. Which one am I 
>to believe?

Yep, it looks like model learned this address instead of learning something useful.



In [28]:

    
eli5.show_prediction(clf, rutgers_example, vec=vec, 
                     target_names=twenty_test.target_names, 
                     targets=['soc.religion.christian'])









    Out[28]:





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=soc.religion.christian
    


    
    (score 2.044)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
        
        
        
        
            
    
        +2.706
    
    
        Highlighted in text (sum)
    
    

        
        

        
        
            
    
        -0.662
    
    
        <BIAS>
    
    

        

        
    

    



    
        in article <apr.8.00.57.41.1993.28246@athos.rutgers.edu> rexlex@fnal.gov writes:
>in article <apr.7.01.56.56.1993.22824@athos.rutgers.edu> shrum@hpfcso.fc.hp.com
>matt. 22:9-14 'go therefore to the main highways, and as many as you find
>there, invite to the wedding feast.'...

>hmmmmmm.  sounds like your theology and christ's are at odds. which one am i 
>to believe?

Quoted text makes it too easy for model to classify some of the messages; that won't generalize to new messages. So to improve the model next step could be to process the data further, e.g. remove quoted text or replace email addresses with a special token.

You get the idea: looking at features helps to understand how classifier works. Maybe even more importantly, it helps to notice preprocessing bugs, data leaks, issues with task specification - all these nasty problems you get in a real world.